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CN I Abstract: In this paper it is shown how to map a data manifold into a sim- 

^^ ■ pier form by progressively discarding small degrees of freedom. This is the 

,__il key to self-organising data fusion, where the raw data is embedded in a very 

PQ ■ high-dimensional space (e.g. the pixel values of one or more images), and the 

requirement is to isolate the important degrees of freedom which lie on a low- 
dimensional manifold. A useful advantage of the approach used in this paper is 
fi I that the computations are arranged as a feed-forward processing chain, where 

—'• all the details of the processing in each stage of the chain are learnt by self- 

organisation. This approach is demonstrated using hierarchically correlated 
^ ■ data, which causes the processing chain to split the data into separate process- 

(~-- I ing channels, and then to progressively merge these channels wherever they are 

^—1 ■ correlated with each other. This is the key to self-organising data fusion. 

O ■ 

\o- 

, 1 Introduction 

■^ ■ 

^^ , The aim of this paper is to illustrate an approach that maps raw data into 

Y^ ' a representation that reveals its internal structure. The raw data is a high- 

• • , dimensional vector of sample values output by a sensor such as the samples of 

a time series or the pixel values of an image, and the representation is typically 
a lower dimensional vector that retains some or all of the information content 
^ ' of the raw data. There are many ways of achieving this type of data reduction, 

5^ , and this paper will focus on methods that learn from examples of the raw data 

alone. 

A key approach to data reduction is the self-organising map (SOM) ^. 
There are many variants of the SOM approach which may be used to map raw 
data into a lower dimensional space that retains some or all of its information 
content. In order to increase the variety of mappings SOMs can learn some 
of these variants use quite sophisticated learning algorithms. For instance, the 
topology of a SOM can be learnt by the neural gas approach J2], or the topology 
of the network connecting several SOMs can be learnt by the growing hierarchi- 
cal self-organising map (GHSOM) approach p|. 

The approach used in this paper aims to achieve a similar type of result to 
the GHSOM approach. GHSOM is a top-down coarse-to-fine approach to opti- 
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raising a tree structured network of SOMs, whereas in this paper a bottom-up 
fine-to-coarse approach will be used that learns a tree structure where appropri- 
ate. The choice of a fine-to-coarse rather than coarse-to-fine approach is made 
in order to obtain networks that can be readily appHed to data fusion prob- 
lems, where the goal is to progressively discard noise (and irrelevant degrees of 
freedom) as the data passes along the processing chain, thus gradually reducing 
its dimensionality to eventually obtain a low-dimensional representation of the 
original raw data. 

The basis for the approach used in this paper is a Bayesian theory of SOMs 
j3] in which a SOM is modelled as an encoder/decoder pair, where the decoder 
is the Bayes inverse of the encoder. In this approach the encoder is modelled as 
a conditional probability over all possible codes given the input, and the code 
that is actually used is a single sample drawn from this conditional probability 
(i.e. a winner-take-all code). When the conditional probability is optimised 
to minimise the average EucHdean distortion between the original input and 
its reconstruction this leads to a network that has properties very similar to a 
Kohonen SOM. 

The basic approach (J needs to be extended in two separate ways [S]. Firstly, 
to encourage the self-organisation of a processing chain leading from raw data to 
a higher level representation, the single encoder/decoder is extended to become 
a Markov chain of connected encoders, where each encoder feeds its output into 
the next encoder in the chain. Secondly, to encourage the self-organisation of 
each encoder into a number of separate smaller encoders and thus to learn tree- 
structured networks where appropriate, each encoder is generalised to use codes 
that make simultaneous use of several samples from the conditional probability 
rather than only a single winner-take-all sample. 

The goal of the approach used in this paper is similar to that of the multiple 
cause vector quantisation approach [^ , because the common aim is to split data 
into its separate components (or causes). However, the approach used in this 
paper aims to minimise the amount of manual intervention in the training of the 
network, and thus allow the structure of the data to determine the structure of 
the network. This is made possible by using codes that consist of several samples 
from a single conditional probability, which allows each encoder to decide for 
itself how to split into a number of separate smaller encoders. Also the approach 
used in this paper does not make expHcit use of a generative model of the data, 
because the aim is only to map raw data into a representation that clarifies its 
internal structure (i.e. build a recognition model), for which a generative model 
may be sufficient but is actually not necessary. 

This paper is organised as follows. In Section[21the structure of data is repre- 
sented as smooth curved manifolds, and encoders are represented as hyperplanes 
that sHce through these manifolds. In Section El the theory of a single encoder 
is developed by extending a Bayesian theory of SOMs H] from winner-take-all 
encoders to multiple output encoders, and this theory is further extended to 
Markov chains of connected encoders |^ . In Section 01 these results are used 
to train a network on some hierarchically correlated data to demonstrate the 
self-organisation of a tree-structured network for processing the data. 



2 Data Manifolds 

In order to represent the structure of data a flexible framework needs to be 
used. In this paper an approach will be used in which the structure of the data 
manifold is of primary importance, and the aim is to split apart the manifold 
in such a way as to reveal how its overall structure is composed. This approach 
must take account of the relative amplitude of the various contributions, so 
that a high resolution representation would include even the smallest amplitude 
contributions to the manifold, and a low resolution representation would retain 
only the largest amplitude contributions. More generally, it would be useful 
to construct a sequence of representations, each with a lower resolution than 
the previous one in the sequence. This could be achieved by progressively dis- 
carding the smallest degree of freedom to gradually lower the resolution of the 
representation. In effect, the representation will become increasingly abstract 
as it becomes more and more invariant to the fine details of the original data 
manifold. 

In Section[0]the basic notation used to describe manifolds is presented, and 
in Section 1221 the process of splitting a manifold into its component pieces and 
then reassembling these to form an approximation to the manifold is described. 

2.1 Representation of Data Manifolds 

Assume that the raw data vector x lies on a smooth manifold x{u), parame- 
terised by u which is a vector of co-ordinates in the manifold. Usually, though 
not invariably, a; is a high-dimensional vector (e.g. an image comprising an 
array of pixel values) and u is a low-dimensional vector (e.g. a vector of ob- 
ject positions), in which case the space in which x lives is a high-dimensional 
embedding space for a low-dimensional manifold. Typically, u represents the 
underlying degrees of freedom (e.g. object co-ordinates), whereas x represents 
the observed degrees of freedom (e.g. sensor measurements). Usually, u will 
contain some noise degrees of freedom, but these can be handled in exactly the 
same way as other degrees of freedom by splitting m as m = (ws, Un) where Us is 
signal and u„ is noise. The probability density function (PDF) Pr(u) describes 
how the manifold is populated and Pr(a;) (where Pr(a;) = ^ du'Pv{u)5{x — x{u))) 
describes how the embedding space is populated. 

In general, x{u) is a non-Hnear function of u so the manifold is curved, and 
thus occupies more linear dimensions of the embedding space than would be 
the case if the manifold were not curved. It is commonplace for a 1-dimensional 
manifold (i.e. u is a scalar) to be curved so as to occupy all of the linear 
dimensions of the embedding space (e.g. the manifold of images generated by 
moving an object along a 1-dimensional line of positions). 

If x{u) can be written as x{u) = {xi{ui),X2{u2)), where xi{ui) and X2{u2) 
are independently parameterised manifolds living in separate subspaces of the 
embedding space (where dim x = dim xi -\- dim X2 and dim u = dim ui -I- dim U2) , 
then x{u) describes a tensor product of manifolds as shown in Figure^. This 
type of manifold arises when the underlying degrees of freedom are measured 
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Figure 1: Examples of manifolds generated by images of a pair of objects. In 
each of the two diagrams the upper half shows the sensor data, and the lower half 
shows the low-dimensional manifold topology assuming that the sensor data have 
circular wraparound. The high-dimensional manifold geometry has a number 
of dimensions equal to the number of pixels in the corresponding sensor, (a) 
Tensor product of manifolds: 2-torus topology. This is generated by observing 
each object using a separate sensor, (b) Superposition (or mixture) of manifolds: 
This is generated by observing both objects using the same sensor, so that there 
is the possibility of overlap (possibly with obscuration) of the sensor data from 
the two objects. In the limit where the objects overlap infrequently case (b) 
closely approximates case (a). 



by separate sensors. This parameterisation can readily be generalised to x{u) = 
{xi{ui),X2{u2), ■ ■ ■ ,Xk{uk)) for k > 2. For independently populated manifolds 
Pr(u) factorises as Pr(u) — Pr(ui) Pr(u2), and ii x{u) = {xi{ui),X2{u2)) then 
Pr(a;) = Pr(a;i) Pr(x2) where Pr{xi) = J dui Pr{ui)S{xi—Xi{ui)) for i = 1,2. For 
manifolds that are populated in a correlated way (i.e. Pr(M) 7^ Pr(ui) Pr(u2)) 
no such simple result holds. 

If x{u) can be written as x{u) = xi{ui) + 2:2 (^2) (where dimx — dinia;i = 
dima;2), then x{u) describes a superposition (or mixture) of manifolds as shown 
in Figure QJd. This type of manifold arises when the underlying degrees of 
freedom are simultaneously measured by the same sensor. If there is little 
or no overlap between xi{ui) and X2{u2) then this is approximately equiv- 
alent to the case x{u) = {xi{ui),X2{u2)) (where dimx — dimxi -I- dimx2 
and dim-u — dimui + dimu2). On the other hand, where there is a signif- 
icant amount of overlap so that xi{ui).X2{u2) > 0, there is no such corre- 
spondence. Assuming Pr(u) = Pr(ui)Pr(u2) then Pr(x) is given by Pr(x) = 

J duidU2Pj:{ui)PT{u2)S{x - Xi{ui) — X2{U2)). 

More generally, x{u) can be written as x{u) = x{ui, U2) where x(ui, U2) has 
no special dependence on ui and U2- Although the manifolds are independently 
parameterised by ui and U2, when they are mapped to x their tensor product 



structure is disguised by the mapping function x{ui,U2) which is usually not 
invertible. The superposition of manifolds x{u) = xi{ui) + 2:2(^2) is a special 
case of this effect. 

2.2 Mapping of Data Manifolds 

Given examples of the raw data x how can an approximation to the mapping 
function x{u) be constructed? The detailed approach will be described in Sec- 
tional but the basic geometric ideas will be described here. The basic idea is 
to cut the manifold into pieces whilst retaining only a limited amount of infor- 
mation about each piece, and then to reassemble these pieces to reconstruct an 
approximation to the manifold. This process is imperfect because it is disrupted 
by discarding some of the information about each piece, so the reconstructed 
manifold is not a perfect copy of the original manifold. This loss of informa- 
tion is critical to the success of this process, because if perfect information were 
retained then there would be no need to discover a clever way of cutting the 
manifold into pieces, and thus no possibility of discovering the structure of the 
manifold (e.g. whether it is a simple tensor product). The information that 
is preserved depends on exactly how the curved manifold is mapped to a new 
representation (see Section for details). 





Figure 2: Using hyperplanes to slice pieces off convex curved manifolds. SHcing 
a curved manifold into pieces prepares it for mapping to another representa- 
tion, (a) 1-dimensional manifold with arcs being sliced off by chords, (b) 2- 
dimensional manifold with caps being sliced off by planes (only a few of these 
are shown in order to keep the diagram simple). 



Figure EK shows an example of how a convex 1-dimensional manifold can be 
cut into overlapping pieces by a set of lines, and Figure |2l3 shows the generali- 
sation to the 2-dimensional case. This process is considerably simplified if the 
manifold is convex because then the hyperplane slices off a locaHsed piece of the 
manifold, as required. 
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Figure 3: Manifolds generated by a 1-dimensional object. The data vector is 

where Xi = exp(— ^'^_°^ ), a is the width 
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of the object function, a (— oo < a < oo) is the position of the object, and 
i (i = 0, ±1, ±2, • ■ • ) is the location of the points where the object amplitude 
is sampled. The manifolds shown are 3-dimensional embeddings {xi,X2tXz) 
of the 1-dimensional curved manifolds generated as a varies for a variety of 
object widths a. For a — 0.25 (i.e. a narrow object function) the manifold 
is concave with cusps, as a is increased the concavity and the cusps become 
less pronounced until the manifold crosses the border between being concave 
and being convex, and for cr = 1 (i.e. a wide object function) the manifold is 
smoothly convex. Concave manifolds with cusps are not well suited to being 
sliced apart by hyperplanes whereas smoothly convex manifolds are well suited, 
and this type of convex manifold is typical of high-dimensional data which also 
has a high resolution so that each object covers several sample points. 

FigureElshows an example of how the convexity assumption can break down. 
The full embedding space contains the vector formed from an array of samples of 
a 1-dimensional object function, but only three dimensions of the full embedding 
space are shown in Figure El Several scenarios are shown ranging from a narrow 
object (i.e. undersampled) to a broad object (i.e. oversampled) . OversampHng 
leads to a smooth convex manifold, whereas undersampling leads to a concave 
manifold with cusps. Typically, convex manifolds occur in signal and image 
processing where the raw data are sampled at a high enough rate, and non- 
convex manifolds occur when the raw data has already been processed into a 



low-dimensional form, such as when some underlying degrees of freedom (or 
features) have already been extracted from the raw data. 
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Figure 4: Using a stochastic vector quantiser (SVQ) to map a curved manifold 
to a new representation (see Section 13 for details) . The manifold is the unit 
circle which is softly sHced up by the SVQ posterior probabilities, which are 
defined as the (normalised) outputs of a set of sigmoid functions, which in turn 
depend on a set of weight vectors and biases. For each sigmoid function a dashed 
line is drawn to show where its (unnormalised) output is i, although here it is 
the curved contours of the posterior probabilities (rather than the dashed lines) 
that are actually used to sHce up the manifold. 

Figure 01 shows the results obtained by for a circular manifold using the 
stochastic vector quantiser (SVQ) approach of Section^ The results correspond 
to Figure Et, except that now the sHcing is done softly in order to preserve 
additional information about the manifold, and to ensure that the reconstructed 
manifold does not show artefacts when the slices are reassembled. 

Figure 13 shows an example of how a convex (1 + e)-dimensional manifold (a 
small length extracted from a cylindrical surface) can be cut into overlapping 
pieces by a set of planes. The 1 in (1 + e) is a large degree of freedom (arc length 
around the cylinder)), whereas the e in (1+e) is a small degree of freedom (length 
along the cyHnder) because it has a small amplitude compared to the large 
degree of freedom. Because of the orientation of the planes they are insensitive 
to the small degree of freedom, so the reconstructed manifold is 1-dimensional 
manifold (i.e. the e component has been discarded). The orientation of the 
planes may be used in various ways to control their sensitivity to the manifold. 




Figure 5: Using hyperplanes to slice pieces off a curved manifold. The manifold 
is 2-dimensional with a large and a small degree of freedom. Each hyperplane 
slices through the manifold in such a way that it cuts off a piece of the manifold 
that has a limited range of values of the large degree of freedom but all possible 
values of the small degree of freedom. This is the basic means by which a 
manifold can be mapped to a new representation. 

and some quite sophisticated examples of this will be discussed in Section ^ 
These results generalise to soft slicing as used in Figure 21 

3 Learning a Manifold 

In order to learn how to represent the structure of a data manifold a flexi- 
ble framework needs to be used. In this paper an approach will be used in 
which the manifold is mapped to a lower resolution representation in such a 
way that a good approximation to the original manifold can be reconstructed. 
Key requirements are that these mappings can be cascaded to form sequences 
of representations of progressively lower resolution having more and more in- 
variance with respect to details in the original manifold, and that the mappings 
can learn to represent tensor products of manifolds so that the representation 
of the manifold can split into separate channels. To achieve this it is sufficient 
to use a variant j^ of the standard vector quantiser [2J to gradually compress 
the data. 

In Section l3?Tl the theory of stochastic vector quantisers (SVQ) is presented, 
and in Section l3^ it is extended to chains of linked SVQs. 

3.1 Stochastic Vector Quantiser 

As was discussed in Section 12 a procedure is needed for cutting a manifold 
into pieces and then reassembling these pieces to reconstruct the manifold. It 



turns out that all of the required properties emerge automatically from vector 
quantisers (VQ), and their generalisation to stochastic vector quantisers (SVQ). 
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Figure 6: A matched encoder/decoder pair represented as a folded Markov chain 
(FMC) a;o — > xi — > xi — > xq. The input xq is encoded as xi which is then 
passed along a distortionless communication channel to become xi which is 
then decoded as xq. The encoder is modelled using the conditional probability 
Pr(a;i|a;o) to allow for the possibility that the encoder is stochastic, and the 
corresponding decoder is modelled using the Bayes inverse conditional probabil- 
ity Pr(a;o|a;i)- The distortionless communication channel is modelled using the 
delta function 5{xi — xi). 

Figure El shows a folded Markov chain (FMC) as described in 0. An FMC 
encodes its input xq (i.e. cuts the input manifold into pieces using the condi- 
tional PDF Pr(a:i]a;o)) and then reconstructs an approximation to its input xq 
(i.e. reassembhng the pieces to reconstruct the input manifold using the Bayes 
inverse PDF Pr(a;o|xi) = '' Pr('i { )■, so it is ideally suited to the task at 
hand. An objective function D needs to be defined to measure how accurately 
the reconstruction a;o approximates the original input xq. 

It is simplest to use a Euclidean objective function that measures the average 
squared (i.e. L^) distance ||2;o — a;o|P, and which must be minimised with respect 
to the encoder Pr(xi|a;o) (note that the decoder Pr(a;o|a;i) is then completely 
determined by Bayes' theorem). 
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dxodxidxodxi Pr(a;o) Pr{xi\xQ)S{xi — xi) Pr(a;o|a;i) ||a;o — xo\\ (1) 



Using Bayes' theorem Equation ^ can be manipulated into the form 0] 



D = 2 dxodxi Pr(a;o) Pr(a;i|a::o) ||xo — a;o(a;i) 



(2) 



where D must be minimised with respect to both the encoder Pr(a;i|a:o) and the 
reconstruction vector a;o(a;i). Note that this simpHfication of Equation Q] into 
Equation 121 depends critically on the Euclidean form of the objective function. 
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Figure 7: An encoder/decoder pair represented as the chain xo — > xi — > 
a;o(a;i). This contains only those parts of the EMC that affect the Euclidean 
distortion objective function. 

FigureQis a transformed version of Figure|H|that reflects the transformation 
of Equation Q] into Equation |21 The encoder Pr(a;i|a;o) is the most important 
part of this diagram, whereas the reconstruction vector a;o(xi) is less important 
so it is shown as a dashed Hne. 

Thus far a non-parametric representation of Pr(a;i|a;o) and xq{xi) has been 
used, so analytic minimisation of D [3] leads to Pr(a;i|a:o) — > S{xi — xi(xo)), 
in which case the encoder could be perfect (i.e. lossless) and it would not be 
possible to discover the structure of the input manifold, as discussed in Section 
[2I To make progress constrained forms of Pr(a;i|a;o) and a;o(a;i) must be used 
in order to limit the resources available to the encoder/decoder, and thus force 
it to discover clever ways of mapping the input manifold to reduce the damage 
caused by having only limited coding resources. 

One way of constraining the encoder/decoder is for xi to be a scalar index 
2/1 where yi = 1, 2, • ■ • , mi (mi is the size of the code book), which is a single 
sample drawn from the encoder Pr(xi|a;o). Analytic minimisation of D 0j now 
leads to Pr(a;i|a;o) — > ^yuvii^a) so that D = 2/da;oPr(a;o)||a;o - a;o(yi(a;o))|P 
which is the objective function for a standard least squares vector quantiser [J]. 

A better way of constraining the encoder/decoder is for xi to be the his- 
togram {vi,h'2,' ■ ■ , ^mi) of counts of independent samples of the scalar index 
2/1 {p-i — X^viLi ^vi is the total number of samples), and for Xo{xi) to be ap- 
proximated as X[)[xi) « Y^^^^i 77^^0(2/1) (rather than using the full functional 
form a;o(j^i, 1^2, • • • > ^'mi))- Although it is still possible to obtain analytic results 
it usually requires a lot of calculation [SJ, and it is generally better to use a 
numerical optimisation approach. An upper bound for the objective function D 
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is then given by j^ 



D <^JdxoPr{xo) E P<yiM \\xo - xo{yi)f 



+ ^^^ J dxoPr{xo) 



2^0 - E P^iyiMxoiyi] 

yi = l 



(3) 



where the unconstrained encoder/decoder corresponds to the left hand side of 
Equation El and the constrained encoder/decoder corresponds to the right hand 
side of Equation El Note how the random fluctuations in the multiple sample 
histogram are analytically summed over in Equation O leaving only the single 
sample encoder Pr{yi\xo) to be optimised. 

A further constraint is to assume that Pr(yi|a;o) is parameterised as the 
normalised output of a set of sigmoid functions 



Pr(yi|a::o) 

Q(yiko) 



Q(vi\xo) 



1 



H-oxp(-toio(j/i).2;o-6i(j/i)) 



where Q{yi \xo) is the unnormalised output from code index j/i, depending on the 
weight vector u'io(yi) and the bias 5i(yi). This parameterisation of Pr(yi|xo) 
ensures that it can be used to slice pieces off convex manifolds as illustrated in 
Figure 21 Optimisation of the objective function is then achieved by gradient 
descent variation of the three sets of parameters wiQ{yi), fei(yi), and a:o(yi). 
These (and other) derivatives of the objective function were given in 0|. 

The constrained objective function in Equation (21 and Equation 01 yields a 
great variety of useful results, such as the simple result shown in Figure 01 which 
used mi = 6, ni = 20, and xq = (cos 6*, sin 0) with 9 uniformly distributed in 
[0, 27r], to learn a mapping from the 1-dimensional input manifold embeded in a 
2-dimensional space (dimxo = 2) to a 6-dimensional space {rrii — 6). This is the 
key objective function that can be used to optimise the mapping of the input 
manifold to a new representation Pr{yi\xQ) for j/i = 1,2, •• • ,mi. Minimising 
the EucHdean distortion ensures that Pr(j/i|xo) defines an optimal mapping of 
the input manifold, such that when ni samples are drawn from Pr(yi|xo) they 
contain enough information to form an accurate reconstruction of xq. 

Figure (HI is a transformed version of Figure that shows an example of the 
structure of some types of optimal solution that are obtained by minimising 
the constrained objective function in Equation^ The tensor product structure 
of the input manifold is revealed in this type of solution, because the input 
vector Xq splits into two parts as xq = (xg,a;o) each of which is separately 
encoded/decoded. This type of factorial encoder is favoured by limiting the size 
of the code book m and by using an intermediate number of samples ni (rii — 1 
leads to a standard VQ, and ni — > cx) allows too many coding resources to 
lead to clever encoding schemes). 

The self-organised emergence of factorial encoders is one of the major strengths 
of the SVQ approach. It allows the code book to split into two or more separate 
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Figure 8: A factorial encoder/decoder pair represented as the pair of discon- 
nected chains Xq — > xf — > a;o(a;") and Xq — > x\ — > Xq{x\). The input vector 
is Xq = {xq,Xq) highUghted by the left hand rectangle, the code is xi = (xf,x\) 
highlighted by the right hand rectangle, and the reconstruction is a;o = {xq,Xq). 
The dependencies amongst the variables is indicated by the arrows in the dia- 
gram, which shows that subspaces a and b are independently encoded/decoded. 



smaller code books in a data driven way rather than being hard-wired into the 
code book at the outset (e.g. 0). 



3.2 Chain of Stochastic Vector Quantisers 

The encoder/decoder in Figure leads to useful results for mapping the input 
data manifold when its operation is constrained in various ways. A much larger 
variety of mappings may be constructed if the encoder/decoder is viewed as a 
basic module, and then networks of linked modules are used to process the data 
j^. It is simplest to regard this type of network as progressively mapping the 
input manifold as it flows through the network modules. 

Figure El shows a 3-stage chain of linked encoder/decoders of the type shown 
in Figure The important part of this diagram is the processing chain which 
is the solid line flowing from left to right at the top of the diagram creating the 

T., , , . Pr(2;i|a;o) Pr(a;2|a:i) Pr(a;3|2:i) j. j.- j. 

Markov cham xo — > xi — > X2 — > 3:^3. The reconstruction vectors 
xi^i{xi) for I — 1,2,3 are the dashed Hues flowing from right to left. 



The state xi of layer I of the chain is the histogram (z/i , 1^2 , • 



counts of samples drawn from Pr(j/i|a;;_i) (n; = ^ 



,) of 



1 "^yi 



is the total number 
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Figure 9: A 3-stage chain of linked SVQs. The l*^ encoder is modelled using 
the conditional probability Piij-i{xi\xi-i), and the corresponding decoder is 
modelled using the reconstruction vectors xi-i{xi), where each xi is a histogram 
of samples. 

of samples). In numerical implementations xi is chosen to be the (normalised) 
histogram for an infinite number of samples (i.e. the relative frequencies implied 
by Pr(y;|x;_i)), and xi_i{xi) is chosen to depend on only a finite number m; of 
samples randomly selected from this histogram xi-i{xi) w J2^'=i -^xi-i{yi). 
This choice of how to operate the network is not unique but it has the advantage 
of simplifying the computations. The infinite number of samples used in xi en- 
sures that the xi do not randomly fluctuate, so no Monte Carlo simulations are 
required to implement the feed-forward flow through the network. The finite 
number of samples ni used in xi-i{xi) leads to exactly the same objective func- 
tion as in Equation El where the random fluctuations are analytically summed 
over, which ensures that each decoder has limited resources and thus forces the 
optimisation of the network to discover inteUigent ways of encoding the data. 
This type of network reduces to a standard way of using a Markov chain when 
only a single sample is drawn from each of the Pr{yi\xi-i). 

Each stage of the chain corresponds to an objective function of the form 
shown in Equation but applied to the l^^ stage of the chain. The total ob- 
jective function is a weighted sum of these individual contributions. This en- 
courages all of the mappings in the chain to minimise their average Euclidean 
reconstruction error, which gives a progressive mapping of the input manifold 
along the processing chain. However, the relative weighting of the later stages 
of the chain must not be too great otherwise they force the earlier mappings 
in the chain to become singular (e.g. all inputs mapped to the same output), 
because the output of a singular mapping can be mapped with little or no more 
contribution to the overall objective function further along the chain. A less 
extreme form of this phenomenon can be used to encourage factorial encoders 
to emerge, because they produce a (normalised) histogram output state that has 
a smaller volume (in the Euclidean sense) than a non-factorial encoder, which 
reduces the size of the contribution to the overall objective function from the 
next stage in the chain. 

If the chain network topology in Figure El is combined with the factorial 
encoder/decoder property of SVQs shown in Figure (HI then all acyclic network 
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topologies are possible. This can be seen intuitively because flow through the 
chain corresponds to flow along the time- like direction in an acyclic network (i.e. 
following the directed Hnks) , and multiple parallel branches occur wherever there 
is an SVQ factorial encoder in the chain. An example of the emergence of this 
type of network topology will be shown in Section ^ 

4 Learning a Hierarchical Network 

The purpose of this section is to demonstrate the self-organised emergence of 
a hierarchical network topology starting from a chain-like topology of the type 
shown in Figure El For the purpose of this demonstration the raw data must 
have an appropriate correlation structure, which will be achieved by generating 
the data as a set of hierarchically correlated phases. Thus each data vector 
is a 4-dimensional vector of phases = {4>i,4>2,4>2,4>4), where the 4'i are the 
leaf nodes of a binary tree of phases, where the binary splitting rule used is 
(j) — > {4> — a,4> A- (3) with a and (3 being independently and uniformly sampled 
from the interval [0, -1], and the phase of the root node is uniformly distributed 
in the interval [0, 27r]. This will lead to each of the </>,; being uniformly distributed 
phase variables thus uniformly occupying a circular manifold. However, because 
the (t)i are correlated due to the way that they are generated by the binary 
splitting process, the do not uniformly populate a 4-torus manifold (i.e. tensor 
product of 4 circles). 

Figure showed an example of what the manifold of a pair of correlated 
variables looks like, with a large degree of freedom (e.g. <j)i -1-02) and a small 
degree of freedom (e.g. 0i — 02), and the hyperplanes encoding the manifold 
in such a way as to discard information about the small degree of freedom (e.g. 
01 — 02). In this way the 3-stage chain in Figure El can progressively discard 
information about small degrees of freedom in 0, starting with a 4-torus manifold 
(non-uniformly populated) and ending up with a circular manifold (uniformly 
populated), as will be seen below. This is the basic idea behind using this type 
of self-organising network for data fusion. 

Figure EH shows the co-occurrence matrices of pairs of the 0i displayed as 
scatter plots. The bands in these plots wrap around circularly and correspond 
to the manifold shown in Figure Because 0i and 02 (and also 03 and 04) lie 
close to each other in the hierarchy, Pr(0i_02) and Pr(03,04) have a narrower 
band than Pr(0i^03), Pr(0i,04), Pr(02,03) and Pr(02,04). 

A 3-stage chain of linked SVQs of the type shown in FigureElis now trained, 
where each stage contributes an objective function of the form shown in Equa- 
tion and Equation 01 The sizes M of each of the 4 network layers are 
M = (8,16,8,4). The size of layer (the input layer) is determined by the 
dimensionality of the input data, whereas the sizes of each of the other layers 
is chosen to be 4 times the number of phase variables that each is expected to 
use in its encoding of the input data, which encourages the progressive removal 
of small degrees of freedom from the data as it flows along the chain into ever 
smaller layers. The number of samples n used for each of the 3 SVQ stages are 
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Figure 10: Co-occurrence matrices of all pairs of phases. Note that the block 
structure is symmetric so each off-diagonal co-occurrence matrix appears twice. 
The hierarchical correlations cause (/)iand 02 (and similarly 03 and 04 ) to be 
more strongly correlated with each other than 02 and 03. 



n — (20,20,20), which are large enough to allow each SVQ to develop into a 
factorial encoder, so that the processing can proceed in parallel along several 
paths (which are progressively fused) along the chain. The relative weightings 
A assigned to the objective functions contributed by each of the 3 SVQ stages 
are A = (1,5,0.1), where a large weighting is assigned to the stage 2 SVQ to 
encourage the stage 1 SVQ to develop into a factorial encoder, and a small 
weighting is assigned to the stage 3 SVQ because the stage 2 SVQ needs no 
additional encouragement to develop into a factorial encoder. 

The network was trained by a gradient descent on the overall network ob- 
jective function, using a step size chosen separately for each SVQ stage and 
separately for each of the 3 parameter types wi^i-i{yi), bi{yi), and xi-i^i{yi) in 
each SVQ stage (for I = 1,2,3). The size of all of these step size parameters 
was chosen to be large at the start of the training schedule, and then gradually 
reduced as training progressed, with the relative rate of reduction being cho- 
sen to encourage earlier SVQ stages to converge before later SVQ stages. In 
general, different choices of network parameters and training conditions lead to 
different types of trained network, and since there is no prior reason for choosing 
one particular solution in preference to another the choice must be left up to 
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the user. All the components of the weight vectors, biases, and reconstruction 
vectors are initialised to random numbers uniformly distributed in the interval 
[-0.1,0.11. 




Figure 11: Reconstruction vectors xi-^ij{yi) (for / — 1,2,3) after training a 
3-stage network of linked SVQs on the hierarchically correlated phases data. 
These diagrams are rotated 90° anticlockwise relative to Figure El so the pro- 
cessing chain runs from bottom to top of each diagram. Line thickness indicates 
the size of a reconstruction vector component, and dashing indicates that the 
component is negative, (a) All reconstruction vector components, (b) Largest 
reconstruction vector components obtained by applying a threshold to the mag- 
nitude of each component, (c) The same as (b) except that the positions of the 
nodes in all layers (other than the input layer) have been permuted (along with 
the connections between layers) to make the network topology clearer. 

Figure ITTl shows the reconstruction vectors xi-i^i{yi) (for / = 1,2,3) in a 
trained 3-stage chain of linked SVQs of the type shown in Figure El Recon- 
struction vectors are displayed because they are easier to interpret than weight 
vectors. The data (p = {4>ii4>2i't'2,4>A) is embedded in an 8-dimensional input 
space as X = {xi,X2TX^,Xji,xz,XQ,X'j,x^) where {x2i-i-,X2i) — (cos0i, sin^i). 
The key diagram is Figure ITTl " which shows the largest components of the re- 
construction vectors, and has been reordered to make the hierarchical network 
topology clear. Each of the first two stages of this network has learnt to op- 
erate as two or more encoder/decoders (i.e. a factorial encoder/decoder) as in 
Figure |H1 The first stage of the network breaks into 4 encoder/decoders that 
encode each of the cf)i (see the results in Figure ^^ and Figure [151 for justifica- 
tion of this), the second stage of the network breaks into 2 encoder/decoders 
that encode (j)i + 02 and (/)3 -I- 04 (see the results in Figure El and Figure El 
for justification of this), and the third stage of the network is a single encoder 
that encodes 0i -I- 02 + 03 + 04 (see the results in Figure El and Figure El for 
justification of this). The connectivity in the stage 1 SVQ is not the same for all 
of the 0i because of the interaction between the thresholding prescription used 
to create Figure El and the different orientation of each of the 4 parts of the 
stage 1 factorial encoder with respect to each of the 4 corresponding circular 
input manifolds. 

Although the network in Figure El computes using continuous-valued num- 
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bers, the thresholded reconstruction vectors in Figure ITTb may be inspected to 
reveal the symbolic logic expressions that approximate to each of the (thresh- 
olded) outputs Oi{x) for i = 1,2,3,4 from the highest layer of the network 
(using logical negation Xi to denote —Xi because the inputs He in the range 

[-1,1])- 

Oi (x) = a;2 n 2:3 n a;5 n xg 

O2 (x) = 2:2 n X3 n X5 n xg 

^Oi{x)^ _ (5) 

03(2;) — xi n x^Ci xq (1 xr ^ ' 

04(2;) — xi HxiOxe r\xr 

In order to keep these expressions short they use a sHghtly higher threshold 
than was used to create Figure lllb . because this suppresses some of the re- 
construction vector components linked to (a;i,2:2, 2:3, 2;4). In this particularly 
simple example it is possible to obtain very short symbolic expressions, but 
more generally continuous-valued computations would be needed to obtain good 
approximations to the network outputs. 

Figure El shows some examples of the node activities Pr{yi\xi-i) (for I = 
1,2,3) in the trained network shown in Figure ITTl:. The individual pieces of 
each factorial encoder are indicated by the boxes, and the patterns of activity 
are such that every box contains one or more active nodes, as would be expected 
if each box were acting as a separate encoder. 

FigurelT^I shows simplified versions of the two types of encoder/decoder that 
occur in Figure E] overlaid on a co-occurrence matrix of the type shown in 
Figure Uni 

Figure ^1 and Figure ^1 show the encoders that occur in stage 1 of Figure 
ITTl These are all factorial encoders of the type shown in Figure Et, as can be 
seen from the orientation of the response regions for the various nodes which 
cuts across the band of the co-occurrence matrix in the same way as in Figure 
113b . This corresponds to the connectivity seen in Figure ITTl: where each 4>i has 
its own encoder. 

Figure El and Figure El show the encoders that occur in stage 2 of Figure 
ITTl These are invariant encoders of the type shown in FigureEt, as can be seen 
from the orientation of the response regions for the various nodes which cuts 
across the band of the co-occurrence matrix in the same way as in Figure 113b . 
This corresponds to the connectivity seen in Figure lllb where each of 4>i + 4>2 
and (/)3 -|- 04 has its own encoder. 

FigureEland FigurelTfllshow the encoder that occurs in stage 3 of FigurelTTl 
This is an invariant encoder of the type shown in Figure Et, which corresponds 
to the connectivity seen in FigurelTTl:. 

The diagrams in this section show how a 3-stage chain of linked SVQs of 
the type shown in Figure JHl self-organises to process hierarchically correlated 
phase data. Stage 1 makes an approximate copy of the data where each phase 
is separately encoded, then stage 2 encodes the output of stage 1 discarding the 
smallest degrees of freedom, and finally stage 3 encodes the output of stage 2 
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Figure 12: Some typical examples of node activities PT{yi\xi-i) (for I — 1,2,3) 
in the trained 3-stage network of linked SVQs. The permuted version of the 
network is used to make the results easier to interpret. The area of each filled 
circle is proportional to the activity it represents, and the negative values that 
occur in the input layer are represented by unfilled circles. The hierarchical 
structure of the network is indicated by drawing a box around each part of each 
network layer that acts as a separate encoder, so that typically there are one or 
two active nodes within every box. 



discarding the next smallest degree of freedom. The chain of SVQs has thus split 
itself into a hierarchical network of linked encoders that is optimally matched 
to the task of mapping from the original data at the input to the chain to 
the compressed representation at the output of the chain. This is true self- 
organisation of multiple encoders unlike the hard-wiring of encoders that is 
used in other approaches (e.g. 0). 
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Figure 13: Two ways of encoding a pair of correlated phases <^\ and 02- The 
co-occurrence matrix of 0i and 02 is represented as a narrow band that is 
populated by data points, so that the correlation manifests itself as 0i w 02- 

(a) This shows how an invariant encoder operates, in which the response region 
of each node is oriented so that it has high resolution for the (^\ + ^2 but is 
completely insensitive to ^\ — 4>2- This does not encode information about the 
small degree of freedom measured across the band of the co-occurrence matrix. 

(b) This shows how a factorial encoder operates, in which the response region 
of each node is highly anisotropic, with high resolution for one of the phases 
but completely insensitive to the other phase. Accurate encoding is achieved 
by using the nodes in pairs with orthogonally intersecting response regions, as 
shown in the example highlighted in the diagram. 
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Figure 14: Node activities in layer 1 as a function of the inputs (j)i and ^2 
(with (j)3 = (f)4 = 0) which has the properties of a factorial encoder. In each plot 
the contours representing the node reponse are overlaid on the co-occurrence 
matrix of the pair of inputs (pi and 4)2. One of the contour heights is drawn 
bold to highlight the region where the node response is large. Half of the nodes 
do not respond at all, and the other half split into two subsets of equal size, one 
with high resolution in (j)i but completely insensitive to 02, and the other with 
high resolution in 02 but completely insensitive to (pi . The response is sensitive 
to the small degree of freedom measured across the band of the co-occurrence 
matrix. The non-zero responses correspond to the 8 nodes in layer 1 that are 
strongly connected to (pi and 02- 
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Figure 15: Node activities in layer 1 as a function of the inputs (\)z and 04 
(with 01 = 02 = 0) which has the properties of a factorial encoder. The non- 
zero responses correspond to the 8 nodes in layer 1 that are strongly connected 
to 03 and 04. 
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Figure 16: Node activities in layer 2 as a function of the inputs (^\ and ^2 (with 
(/is = '/>4 = 0) which has the properties of an invariant encoder. Half of the 
nodes do not respond at all, and the other half respond to well-defined regions 
in 01 and <^i. The contours representing the node response are overlaid on the 
co-occurrence matrix of the pair of inputs ^\ and <^i . One of the contour heights 
is drawn bold to highlight the region where the node response is large. This 
shows that each node responds to a local region of the populated region of the 
co-occurrence matrix, and to a limited extent generalises outside this region. 
The response is invariant with respect to the small degree of freedom measured 
across the band of the co-occurrence matrix, which demonstrates that layer 2 
has acquired an invariance that was absent in layer 1. There are also non-zero 
responses in the unpopulated region of the co-occurrence matrix which arise 
because the sum of the node activities is normalised. The non-zero responses 
correspond to the 4 nodes in layer 2 that are strongly connected to (pi and (j)2- 
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Figure 17: Node activities in layer 2 as a function of the inputs ^z and (^4 (with 
</*! = </'2 = 0) which has the properties of an invariant encoder. The non-zero 
responses correspond to the 4 nodes in layer 2 that are strongly connected to 
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Figure 18: Node activities in layer 3 as a function of the inputs (/)i and 02 (with 
'i>z — i>^ — 0) which has the properties of an invariant encoder. 
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Figure 19: Node activities in layer 3 as a function of the inputs (j)^ and 04 (with 
4>i — 4>2 — 0) which has the properties of an invariant encoder. 
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5 Conclusions 

This paper has shown how it is possible to map a data manifold into a sim- 
pler form by progressively discarding small degrees of freedom. This is the 
key to self-organising data fusion, where the raw data is embedded in a very 
high-dimensional space (e.g. the pixel values of one or more images), and the 
requirement is to isolate the important degrees of freedom which lie on a low- 
dimensional manifold. A useful advantage of the approach used in this paper is 
that it assumes only that the mapping from manifold to manifold is organised 
in a chain-like topology, and that all the other details of the processing in each 
stage of the chain are to be learnt by self-organisation. The types of application 
for which this approach is well-suited are ones in which separation of small and 
large degrees of freedom is desirable. For instance, separation of targets (small) 
and jammers (large) is relatively straightforward using this approach |10j . 

Data that is not embedded in a higher-dimensional space is usually not 
suitable for processing with the approach used in this paper. For instance, 
categorical (or symboHc) data that has one of only a few possible states is not 
suitable, but a smoothly variable array of pixel values is suitable. This type 
of network is intended to operate on raw sensor data rather than pre-processed 
data, and typically will use high-dimensional intermediate representations in its 
processing chain. Because of its ability to compress the raw data into a much 
simpler form, this type of network would typically be used as a bridge between 
the sub-symbolic raw sensor data and the symbolic higher level representation 
of that data. 

Although the full connectivity between adjacent layers of the chain implies 
that the computations can be expensive, after some initial training the factorial 
structure of the various encoders becomes clear and can be used to prune the 
connections to keep only the ones that are actually used (i.e. usually only 
a small proportion of the total number) . When the chain is fully trained each 
code index typically depends on only a small number of contributing inputs (i.e. 
a receptive field) from the previous stage of the chain. Furthermore, because of 
the normalisation used in each layer, the size and shape of the receptive fields 
mutually interact (i.e. there is a fixed total amount of activity in each layer), 
so the raw receptive fields (i.e. as defined by the feed-forward network weights) 
are different from the renormalised receptive fields (i.e. after taking account of 
normalisation). 

The network described in this paper passes information along the process- 
ing chain in a deterministic fashion, because it uses (hypothetical) histograms 
containing an infinite number of samples, which thus do not randomly fiuctu- 
ate. This was done for computational convenience (i.e. to avoid Monte Carlo 
simulations) and is not a fundamental limitation of the approach used. With 
additional computational effort it is possible to operate the network as a (non- 
deterministic) Markov chain in which the histograms contain only a finite num- 
ber of samples, which therefore randomly fluctuate and explore network states 
in the vicinity of the deterministic state used in this paper. 
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